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1.  Introduction 


Rapid  and  robust  scene  understanding  is  a  critically  important  goal  for  enhanced 
robot  autonomy;'  however,  the  interpretation  of  spatially  and  temporally  changing 
image  scenes  due  to  varying  environmental  conditions  can  pose  serious  challenges 
for  computer  vision  proeesses,  sueh  as  those  associated  with  vision-based  plaee 
recognition  and  navigation.^’^  Such  challenges  also  can  extend  to  interpreting 
changing  scenes  due  to  visual  motion  of  objects  within  the  field  of  view.^  As  Tu  et 
al."'  discuss,  adverse  weather  or  illumination  conditions  can  make  the  appearance  of 
moving  objects  unclear,  so  that  identifying  moving  objects  in  outdoor  environments 
becomes  more  difficult  for  robot  vision  systems.  As  an  example,  time-  and  space- 
dependent  environmental  effeets  on  image  eontrast  and  resolution  can  be  brought 
about  by  rain  and  snow  weather  events,  fog,  smoke,  obscurants  or  other  changes  in 
lighting  and  visibility.^  Alternately,  visually  degraded  or  blurred  images  due  to  the 
motion  of  objeets  can  occur  due  to  rapid  movements  or  long  exposure  times  in  both 
single  frames  and  sequences  of  recorded  images. 

With  regard  to  lighting  variations,  Andreev  et  al.^  described  a  method  to  estimate 
the  effect  of  space  and  time  changes  in  scene  illumination  on  the  optical  flow  field 
in  a  movie.  Their  researeh^  is  unique,  beeause  there  have  been  many  optical  flow 
approaches  used  to  detect  motion  of  objects  in  a  scene  that  do  not  have  the  scope 
of  space  and  time  scaling  and  analysis,  even  though  they  may  be  helpful  in  a  variety 
of  applications.^"'"  Also,  camera  motion  may  introduce  some  unmanageable 
artifacts  with  the  gradient-based  optical  flow  approach  if  it  is  not  augmented  by 
more  sophisticated  spatio-temporal  analyses. '  Other  difficulties  in  image  motion 
analysis  ean  arise  if  objeets  in  the  scene  have  refleetions;  when  new  objects  appear 
or  old  ones  disappear;  or  when  describing  transparent  motions,  for  example,  the 
motion  of  objects  behind  smoke,  foliage,  or  a  fence. Here,  in  addition  to  capturing 
the  motion  of  individual  objects,  it  appears  neeessary  to  capture  the  relative  motion 
between  individual  objects  and  the  time  and  space  resolutions  of  the  information 
being  collected. 

To  help  mitigate  some  of  the  difficulties  associated  with  the  measurement  and 
analysis  of  changing  scenes,  I  propose  that  it  is  important  to  eonsider  the  spaee  and 
time  scales  of  image  data  from  the  very  beginning  of  the  data  collection  process. 
Incorporating  key  space  and  time  scale  information  at  the  time  of  recording  not 
only  helps  to  systematically  characterize  the  measured  data  but  can  provide  the 
future  analyst  with  a  top-down  approach  to  determine  what  analysis  or  computer 
vision  tasks  are  feasible  with  the  available  data.  This  kind  of  enhanced  analysis  and 
deeision  making  may  also  be  applied  to  future  autonomous  systems.  Alternately,  if 
an  image  analysis  or  computer  vision  objective  is  pre-known  then  it  would  be  useful 
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to  determine  what  image  resolutions  are  needed  to  enable  more  intelligent  data 
collection.  If  neglected,  the  end  user  advantages  provided  by  space  and  time  scale 
characterization  may  be  inextricably  lost. 

In  this  report,  I  begin  to  explore  the  space  and  time  scales  of  image  data  as  they  are 
related  to  the  measurement  and  analysis  of  changing  image  scenes,  and  whether 
scene  variations  are  due  to  environmental  conditions  or  the  motion  of  objects  within 
the  field  of  view  or  both. 

2.  Space  and  Time  Scales 


2.1  Primary  Space  and  Time  Scales 

This  section  provides  a  framework  to  help  categorize  the  spatial  and  temporal 
properties  of  image  data.  Relevant  time  scales  include,  but  are  not  limited  to,  the 
shutter  exposure  time,  the  time  interval  between  frames,  time  over  which  images 
are  captured  in  a  sequence,  and  the  time  over  which  there  is  visual  motion  of  objects 
inside  the  field  of  view.  Space  scales  include,  but  are  not  limited  to,  the  field  of 
view,  depth  of  view,  image  resolution,  pixel  size,  pixel  separation,  color  matrix 
size,  scene  color  or  shading  variations  as  a  function  of  spatial  location,  spatial 
smearing  of  moving  elements  in  the  field  of  view,  spatial  smearing  due  to  optical 
turbulence  and  environmental/  weather  effects,  and  smearing  of  textures  in  the  field 
of  view.  Naturally,  the  smearing  of  elements  in  the  field  of  view  can  also  be  related 
to  the  temporal  resolution  of  the  image  data.  Figure  1  illustrates  the  primary  space 
and  time  scales,  which  can  be  used  to  describe  the  various  spatial  and  temporal 
resolutions  of  objects  and/or  activities  in  a  recorded  image  scene.  Here,  As  and  At 
represent  changes  in  position  and  time,  respectively. 


Fig.  1  Primary  space  (s)  and  time  (t)  scales 
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2.2  Image  Resolution  and  Field  of  View 


To  begin  to  demonstrate  the  impact  of  varying  image  resolution  and  field  of  view 
on  scene  analysis,  try  to  identify  the  3  dome  shapes  shown  in  Fig.  2.  Without  some 
additional  information  related  to  the  object  size,  texture,  or  shape  in  relationship  to 
other  objects  that  may  be  visible  in  an  expanded  field  of  view,  it  is  difficult  to 
correctly  identify  and  label  these  familiar  images.  Furthermore,  distinguishing 
various  image  details,  even  in  ideal  conditions  with  regard  to  lighting  and  visibility, 
can  depend  on  the  image  contrast  and  resolution,  where  image  resolution  here  refers 
to  the  numbers  of  pixels  that  comprise  the  image  data  input.  Interestingly, 
Torralba^"^  reported  that  for  human  vision  the  brain  can  comprehend  the  gist  of  an 
image  scene  remarkably  quickly,  whether  low  resolution  or  high  resolution  images 
are  used.  He  concluded  that  images  at  the  resolution  of  32  x  32  color  pixels  can 
provide  an  observer  enough  information  to  correctly  identify  the  semantic  category 
and  general  layout  of  an  indoor/outdoor  scene.  For  example,  in  Fig.  2  the  main 
“dome”  category  for  these  low  resolution  images  is  identifiable.  However,  if  we 
consider  Fig.  3,  which  contains  expanded  fields  of  view  and  higher  resolution 
images  from  which  the  elements  in  Fig.  2  were  taken,  then  the  building  domes  and 
many  additional  image  details  can  be  identified  over  a  much  wider  range  of  spatial 
scales. 


What  are  these? 


Fig.  2  Can  you  correctly  identify  these  images?  Image  resolution:  a)  30  x  20  pixels, 
b)  30  X  14  pixels,  and  c)  30  x  16  pixels. 


Fig.  3  Higher  resolution  image  scenes  corresponding  to  the  3  shapes  shown  in  Fig.  2. 
a)  photo:  Taj  Mahal  (desktopdress.com),  b)  photo:  US  Capitol  Dome  (Library  of  Congress), 
and  c)  photo:  nuclear  power  plant,  Bushehr,  Iran  (Behrouz  Mehri/ AFP/Getty  Images). 
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Let’s  examine  image  resolution  more  closely.  Can  you  identify  the  2  extracted 
objects  shown  in  Fig.  4  without  some  additional  context?  What  if  we  look  at  the 
complete  image  (Fig.  5)  from  which  the  objects  were  taken?  In  this  case,  at  low 
resolution,  it  is  quite  difficult  to  discern  any  individual  elements  in  the  field  of  view. 


What  is  it? 


A.  Alien  Spaceship  D.  Temple  Dome 
B  Stadium  Dome  E.  Hard  Hat 
C.  Reactor  Dome  F.  Flying  Saucer 


Fig.  4  Can  you  correctly  identify  these  objects? 


Fig.  5  Low  resolution  images  of  the  scene  from  which  the  objects  in  Fig.  4  were  taken, 
where  neither  large  nor  small  objects  are  discernible.  (Left:  16  x  10  image  pixels. 
Right:  32  x  20  image  pixels.) 

Imagine  you  are  in  new  surroundings  and  you  only  have  a  low  resolution,  e.g., 
32  X  32  pixels,  imaging  camera  that  can  be  used  for  wide  area  coverage.  However, 
you  can  change  lenses  on  the  camera  to  narrow  your  field  of  view  such  that  a  region 
of  the  scene  that  was  previously  imaged  by  a  single  pixel  is  now  imaged  with 
32  X  32  pixels  (Fig.  6).  If  applied  to  every  pixel  location  in  the  original  image 
matrix,  the  image  resolution  of  the  wide  area  view  would  be  increased  to 
1024  X  1024  pixels.  However,  this  would  require  much  more  time  and  data 
collection  on  your  part,  and  large  amounts  of  data  transmission  is  often  costly,  i.e., 
bandwidth  limited.  Given  that  you  are  likely  to  be  more  selective  in  recording  the 
narrower  fields  of  view,  e.g.,  focusing  on  an  identified  area  of  interest,  it  would  be 
helpful  to  develop  some  useful  strategies  to  enable  rapid  and  robust  scene  analysis. 
As  an  example,  Wamell  et  al.^^  recently  discussed  concepts  associated  with  visual 
saliency  to  enable  enhanced  camera  control  for  tasks  such  as  automatic  navigation 
and  scene  exploration.  Saliency  estimation,  which  is  a  computational  identification 
of  various  elements  in  a  scene  that  are  likely  to  catch  the  attention  of  a  human 
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observer,  is  a  valuable  tool  in  image  analysis  and  processing. ^^Nevertheless,  future 
work  may  include  extending  these  concepts  to  interpreting  changing  (dynamic) 
scenes, such  as  those  described  above. 


Fig.  6  Framework  for  follow  on  numerical  experiments  to  focus  on  scene  analysis 
strategies 

Such  a  problem  can  be  investigated  and  characterized  through  a  series  of  human- 
in-the-loop  numerical  experiments.  For  example,  it  may  be  useful  at  first  to  develop 
a  general  object  searching  algorithm  that  focuses  on  the  narrow  fields  of  view  at 
randomly  selected  locations  within  an  identified  area  of  interest.  Then,  one  can 
begin  to  analyze  how  many  randomly  revealed  narrow  fields  of  view  would  be 
needed  to  clearly  identify  specific  object  shapes  and  textures  or  to  capture  the  gist 
of  the  image  scene  (as  discussed  by  Torralba).'"*  Later  on,  this  approach  can  be 
expanded  using  a  different  or  improved  strategy. 

Of  course,  the  degree  of  image  resolution  needed  for  a  particular  task  depends  on 
the  analysis  or  computer  vision  problem  of  interest.^’^^“^^For  example,  with  regard 
to  image  analysis  and  labeling,  compare  the  low  resolution  images  in  Fig.  5  to  the 
slightly  higher  resolution  images  shown  in  Fig.  7.  When  the  image  resolution  is 
increased  to  64  x  40  pixels  and  greater,  one  can  more  easily  identify  the  layout  and 
main  elements  of  the  image  scene,  such  as  the  reactor  dome  and  hard  hat  shown 
above.  However,  if  still  higher  resolution  images  of  this  reactor  site  are  analyzed 
(Fig.  8)  then  additional  details  and  information  may  be  gained,  for  example, 
intelligence  relating  to  its  operational  status.  By  analyzing  the  extracted  and  labeled 
objects  shown  in  Fig.  8  one  might  ask  if  the  reactor  site  is  still  under  construction 
or  near  completion  as  evidenced  by  the  engineers  wearing  hard  hats,  the  surveyor, 
the  hoist,  and  the  electrical  hazard  sign.  Note  here  that  the  hoist,  surveyor  and 
engineers  wearing  hard  hats  in  the  far-field  of  the  imaged  scene  all  required 
increased  resolution,  i.e.,  >32  x  32  pixels,  to  be  clearly  identified  (visually  compare 


5 


right  vs.  left  in  Fig.  8).  Table  1  provides  the  image  resolution  details  in  numbers  of 
pixels  for  these  labeled  objects. 


Fig.  7  Same  images  as  shown  in  Fig.  5  bnt  with  slightly  higher  resointion.  Left:  64  x  40 
image  pixels.  Right:  128  x  80  image  pixels. 


Fig.  8  Same  images  as  shown  in  Figs.  5  and  7  hnt  with  even  higher  resointion.  Left:  525  x 
336  image  pixels.  Right:  3888  x  2492  image  pixels.  Note  that  the  hoist,  snrveyor  and  engineers 
wearing  hard  hats  in  the  far-field  of  the  imaged  scene  all  reqnired  increased  image  resointion 
to  he  clearly  identified. 

Table  1  Image  resointion  information  (in  nnmbers  of  pixels) 


Fig.  8  (Left) 

Fig.  8  (Right) 

Main  image 

525 X  336 

3888  X  2492 

Reactor  Dome 

191x  51 

1028  X  256 

Hard  Hat 

82x  45 

405 X  225 

Danger  Sign 

32x  36 

64  X  69 

Hoist 

5  x24 

39  X  175 

Surveyor 

5x11 

34  X  74 

Engineers 

5x5 

32x  41 

In  this  section,  we  have  shown  an  example  of  varying  image  resolution  as  it  relates 
to  detailed  analysis  and  labeling  of  this  outdoor  scene.  In  follow-on  research,  we 
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can  similarly  explore  the  impaet  of  varying  time  scale  related  image  resolution  on 
scene  analysis  using  sequences  of  recorded  images. 

3.  Image  Data  Input 

There  are  several  key  pieces  of  information  that  ean  be  identified  as  new  image  data 
are  being  reeorded  that  are  important  and  accessible,  but  usually  overlooked  or  left 
undocumented.  For  example,  one  can  readily  identify  a  timestamp,  the  global 
positioning  system  (GPS)  position,  the  prevailing  environmental  and  weather 
conditions,  the  field  of  view,  depth  of  view,  and  image  resolution,  as  noted  above. 
Table  2  provides  a  list  of  several  space  and  time  scale  dependent  elements  that  can 
be  incorporated  with  measured  image  data.  The  first  group  focuses  on 
environmental  effeets,  such  as  the  weather,  cloud  cover,  ground  and  road 
conditions,  and  visibility.  Identifying  environmental  conditions  is  an  effective  way 
to  eategorize  diverse  data  sets  of  image  seenes  for  later  use  and  analysis.  For 
example  if,  at  a  later  time,  an  end  user  needs  to  find  image  data  with  a  eertain 
resolution  in  raining  or  low-light  conditions,  then  incorporating  the  space-time 
related  elements  listed  in  Table  2  at  the  time  of  reeording  the  data  can  provide  the 
desired  benefit. 

In  addition,  ehanging  environmental  eonditions  ean  affeet  image  contrast  and 
resolution  due  to  weather  events  and  changes  in  visibility  or  cloud  cover  and  these 
effects  often  coineide  with  lighting  changes  in  an  imaged  scene,  e.g.,  those  due  to 
increased  scattering  and  attenuation  of  light  in  adverse  weather  conditions.^  Time 
of  day  and  sun  angle  information  can  be  useful  also  to  highlight  conditions  when 
increased  glare,  shadows,  or  silhouettes  ean  eause  diffieulties  for  image  analysis 
and  computer  visionrelatedproeesses.^^’^^’^'^Taking  note  offog,  smoke,  obscurants, 
and  optical  turbulence  conditions  is  also  important  because  these  effects  can 
significantly  degrade  image  quality  due  to  spatial  smearing  of  shapes,  textures,  and 
moving  elements  in  an  imaged  scene^^  Similarly,  identifying  ground  and  road 
conditions,  e.g.,  wet  or  dry,  icy,  sand,  or  gravel,  can  be  used  in  subsequent  image 
analyses  or  ean  support  autonomous  systems  with  regard  to  navigation  tasks, 
traeking  personnel,  or  detecting  vehicles. 

The  seeond  group  in  Table  2  lists  elements  related  to  the  image  data  measurements 
themselves,  e.g.,  the  spatial  and  temporal  image  resolutions,  field  of  view,  and 
depth  of  view.  Together  with  the  environmental  effects,  these  data  ean  be  used  as  a 
basie  building  bloek  for  the  analysis  of  changing  image  scenes. 
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Table  2  Incorporating  key  space  and  time  scale  related  information  as  image  data  are 
being  collected 

Environmental  Effects 

Weather  conditions  (rain,  snow,  haze,  fog,  or  hail) 

Sun  angle,  sky,  and  cloud  cover 

Ground/road  conditions  (dry,  wet,  icy,  sand,  gravel,  rocky,  etc.) 

Visibility  (fog,  smoke,  obscurants,  or  optical  turbulence) 

Image  Data  Measurements 

GPS  position 

Timestamp  (relative  to  sun  angle  or  relative  to  a  world  clock) 

Image  resolution 

Pixel  size  and  pixel  separation 

Field  of  view  and  depth  of  view 

Shutter  exposure  time  and  time  interval  between  image  frames 
Time  over  which  images  are  captured  in  a  sequence 


4.  Image  Motion  Characterization 

Polana  and  Nelson^®  suggested  that  image  motion  in  a  seene  can  be  categorized  into 
3  parts.  The  first  group  of  motions  are  those  having  statistical  regularities,  i.e.,  they 
are  repeatable  in  both  space  and  time,  such  as  the  action  of  water  waves  or  the 
motion  of  clouds,  trees,  and  leaves.  In  contrast,  the  second  group  consists  of 
activities,  repeatable  over  time  but  not  over  space,  such  as  people  walking,  biking, 
or  talking.  The  third  group  includes  motion  events^'  that  are  not  repeatable  in  either 
space  or  time,  such  as  a  person  throwing  a  ball  or  entering  a  room.  According  to 
Laptev,^'  such  events  correspond  to  features  in  the  image  scene  appearing  or 
disappearing  and  with  non-constant  motion,  which  often  correspond  to  changes  or 
discontinuities  in  velocity  and  acceleration.  An  alternate  method  for  motion  event 
analysis  based  on  visual  attention  and  temporal  salience  has  been  discussed  by 
Thomas. 

In  follow-on  research,  we  can  develop  additional  numerical  algorithms  and 
experiments  that  can  be  implemented  using  new  or  existing  data  sets  to  help 
recognize  the  motion  of  individual  objects  in  a  scene  and  mitigate  any  undesirable 
artifacts  due  to  camera  motion. Briefly,  we  can  express  the  optical  flow  (i.e., 
image  gradient)  velocity  (it)  of  a  moving  object  in  a  scene  as 

u=  5  =  v(X,tm)+  w,  (1) 
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where,  v  is  the  velocity  of  an  individual  element  in  the  image,  which  is  a  function 
of  position,  +  X2  +  X3,  and  time  (of  successive  frames),  t2,  t3... 

Also  in  Eq.  1,  w  is  the  velocity  of  the  camera  motion  from  one  image  frame  to  the 
next,  which  in  some  cases  is  considered  a  constant.  Next,  we  define  the  divergence 
of  the  optical  flow  velocity  as 


All  At2  At3 

V  U  =  —  +  —  +  — , 

^^2  ^^3 


(2) 


and  realize  from  Eqs.  1  and  2  that  if  one  adds  a  constant  (w)  to  the  velocity  field, 
then  the  divergence  of  the  velocity  field  (Vu)  remains  unchanged.  We  can  also 
explore  the  optical  flow  acceleration  (a)  and  its  divergence  (Va)  in  a  similar 
manner.  We  anticipate  that  our  research  results  will  provide  many  useful  insights 
toward  developing  novel  strategies  for  the  analysis  of  space  and  time  varying 
scenes. 


5.  Summary  and  Conclusions 

In  this  report,  I  began  to  explore  the  space  and  time  scale  aspects  of  image  data  as 
they  are  related  to  the  measurement  and  analysis  of  changing  image  scenes,  and 
whether  scene  variations  are  due  to  environmental  conditions  or  the  motion  of 
objects  within  the  field  of  view  or  both.  I  showed  an  example  that  demonstrated  the 
impact  of  varying  image  resolution  on  the  detailed  analysis  and  labeling  of  an 
outdoor  scene.  I  also  provided  a  list  of  several  space  and  time  scale  dependent 
elements  that,  if  incorporated  at  the  start  of  the  image  data  measurement  process, 
can  provide  an  end  user  with  a  better  organized,  top-down  approach  to  determine 
what  analysis  or  computer  vision  tasks  are  feasible  with  the  available  data.  Einally, 
I  discussed  image  motion  characterization  and  proposed  a  follow-on  research  study 
to  develop  numerical  algorithms  and  experiments  to  explore  and  analyze  changing 
image  scenes  with  new  or  existing  data  sets.  I  anticipate  that  this  research  will  help 
to  advance  Army  relevant  technologies  in  scene  understanding  for  enhanced  robot 
autonomy. 
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